30 research outputs found
An Iterative Co-Saliency Framework for RGBD Images
As a newly emerging and significant topic in computer vision community,
co-saliency detection aims at discovering the common salient objects in
multiple related images. The existing methods often generate the co-saliency
map through a direct forward pipeline which is based on the designed cues or
initialization, but lack the refinement-cycle scheme. Moreover, they mainly
focus on RGB image and ignore the depth information for RGBD images. In this
paper, we propose an iterative RGBD co-saliency framework, which utilizes the
existing single saliency maps as the initialization, and generates the final
RGBD cosaliency map by using a refinement-cycle model. Three schemes are
employed in the proposed RGBD co-saliency framework, which include the addition
scheme, deletion scheme, and iteration scheme. The addition scheme is used to
highlight the salient regions based on intra-image depth propagation and
saliency propagation, while the deletion scheme filters the saliency regions
and removes the non-common salient regions based on interimage constraint. The
iteration scheme is proposed to obtain more homogeneous and consistent
co-saliency map. Furthermore, a novel descriptor, named depth shape prior, is
proposed in the addition scheme to introduce the depth information to enhance
identification of co-salient objects. The proposed method can effectively
exploit any existing 2D saliency model to work well in RGBD co-saliency
scenarios. The experiments on two RGBD cosaliency datasets demonstrate the
effectiveness of our proposed framework.Comment: 13 pages, 13 figures, Accepted by IEEE Transactions on Cybernetics
2017. Project URL: https://rmcong.github.io/proj_RGBD_cosal_tcyb.htm
Global Context-Aware Progressive Aggregation Network for Salient Object Detection
Deep convolutional neural networks have achieved competitive performance in
salient object detection, in which how to learn effective and comprehensive
features plays a critical role. Most of the previous works mainly adopted
multiple level feature integration yet ignored the gap between different
features. Besides, there also exists a dilution process of high-level features
as they passed on the top-down pathway. To remedy these issues, we propose a
novel network named GCPANet to effectively integrate low-level appearance
features, high-level semantic features, and global context features through
some progressive context-aware Feature Interweaved Aggregation (FIA) modules
and generate the saliency map in a supervised way. Moreover, a Head Attention
(HA) module is used to reduce information redundancy and enhance the top layers
features by leveraging the spatial and channel-wise attention, and the Self
Refinement (SR) module is utilized to further refine and heighten the input
features. Furthermore, we design the Global Context Flow (GCF) module to
generate the global context information at different stages, which aims to
learn the relationship among different salient regions and alleviate the
dilution effect of high-level features. Experimental results on six benchmark
datasets demonstrate that the proposed approach outperforms the
state-of-the-art methods both quantitatively and qualitatively
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline
Existing audio-visual event localization (AVE) handles manually trimmed
videos with only a single instance in each of them. However, this setting is
unrealistic as natural videos often contain numerous audio-visual events with
different categories. To better adapt to real-life applications, in this paper
we focus on the task of dense-localizing audio-visual events, which aims to
jointly localize and recognize all audio-visual events occurring in an
untrimmed video. The problem is challenging as it requires fine-grained
audio-visual scene and context understanding. To tackle this problem, we
introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains
10K untrimmed videos with over 30K audio-visual events. Each video has 2.8
audio-visual events on average, and the events are usually related to each
other and might co-occur as in real-life scenes. Next, we formulate the task
using a new learning-based framework, which is capable of fully integrating
audio and visual modalities to localize audio-visual events with various
lengths and capture dependencies between them in a single pass. Extensive
experiments demonstrate the effectiveness of our method as well as the
significance of multi-scale cross-modal perception and dependency modeling for
this task.Comment: Accepted by CVPR202
RRNet: Relational Reasoning Network with Parallel Multi-scale Attention for Salient Object Detection in Optical Remote Sensing Images
Salient object detection (SOD) for optical remote sensing images (RSIs) aims
at locating and extracting visually distinctive objects/regions from the
optical RSIs. Despite some saliency models were proposed to solve the intrinsic
problem of optical RSIs (such as complex background and scale-variant objects),
the accuracy and completeness are still unsatisfactory. To this end, we propose
a relational reasoning network with parallel multi-scale attention for SOD in
optical RSIs in this paper. The relational reasoning module that integrates the
spatial and the channel dimensions is designed to infer the semantic
relationship by utilizing high-level encoder features, thereby promoting the
generation of more complete detection results. The parallel multi-scale
attention module is proposed to effectively restore the detail information and
address the scale variation of salient objects by using the low-level features
refined by multi-scale attention. Extensive experiments on two datasets
demonstrate that our proposed RRNet outperforms the existing state-of-the-art
SOD competitors both qualitatively and quantitatively.Comment: 11 pages, 9 figures, Accepted by IEEE Transactions on Geoscience and
Remote Sensing 2021, project: https://rmcong.github.io/proj_RRNet.htm
SDDNet: Style-guided Dual-layer Disentanglement Network for Shadow Detection
Despite significant progress in shadow detection, current methods still
struggle with the adverse impact of background color, which may lead to errors
when shadows are present on complex backgrounds. Drawing inspiration from the
human visual system, we treat the input shadow image as a composition of a
background layer and a shadow layer, and design a Style-guided Dual-layer
Disentanglement Network (SDDNet) to model these layers independently. To
achieve this, we devise a Feature Separation and Recombination (FSR) module
that decomposes multi-level features into shadow-related and background-related
components by offering specialized supervision for each component, while
preserving information integrity and avoiding redundancy through the
reconstruction constraint. Moreover, we propose a Shadow Style Filter (SSF)
module to guide the feature disentanglement by focusing on style
differentiation and uniformization. With these two modules and our overall
pipeline, our model effectively minimizes the detrimental effects of background
color, yielding superior performance on three public datasets with a real-time
inference speed of 32 FPS.Comment: Accepted by ACM MM 202